1. Introduction

The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. The world is a rapidly changing place and happiness is something that everyone pursuits in the ever growing complicated world. The United Nations Sustainable Development Solutions Network has been publishing the World Happiness report since 2012. This first report presented the available global data on national happiness and reviewed related evidence from the emerging science of happiness, showing that the quality of people’s lives can be assessed by a variety of subjective well-being measures, collectively referred to then and in subsequent reports as “happiness.” Each report includes updated evaluations and a range of commissioned chapters on special topics digging deeper into the science of well-being, and on happiness in specific countries and regions. We will try to do an Exploratory Data Analysis of the 2019 World Happiness Report. It contains rankings of national happiness based on respondent ratings of their own lives, which the report also correlates with various life factors.

Research Questions:

Below are some questions we will expect to answer from our Analysis:

  1. What factors contribute most to happiness of the country?
  2. Which factor is the key contributor for happiness of a country?
  3. Which factor is the least contributor to happiness of a country?
  4. Which region / continent is the happiest?
  5. Are countries lying in a certain region happier than other regions?
  6. How do negative attributes like corruption and negative affect impact the happiness of the country?

2. Preparing the Data

Link to the dataset used for Analysis: https://www.kaggle.com/PromptCloudHQ/world-happiness-report-2019

The data did not require any prior pre-processing and was available in a fairly clean .csv format.

Explanation of columns and variables

  • Country (region): Name of the country.
  • Happiness Rank / Ladder: Cantril Ladder is a measure of life satisfaction.
  • SD of Ladder: Standard deviation of the ladder.
  • Positive affect: Measure of positive emotion.
  • Negative affect: Measure of negative emotion.
  • Social support: The extent to which Social support contributed to the calculation of the Happiness Score.
  • Freedom: The extent to which Freedom contributed to the calculation of the Happiness Score.
  • Corruption: The extent to which Perception of Corruption contributes to Happiness Score.
  • Generosity: The extent to which Generosity contributed to the calculation of the Happiness Score.
  • Log of GDP (per capita): The extent to which GDP contributes to the calculation of the Happiness Score.
  • Healthy life expectancy: The extent to which Life expectancy contributed to the calculation of the Happiness Score.

Understanding the ranks in the dataset

  • Positive Attributes (total 7)

  • There are 7 columns in the data set that indicate Positive Attributes: Happiness Rank / Ladder, Positive Affect, Social Support, Freedom, Generosity, Log of GDP, Healthy Life expectancy.

  • Lower number of these attributes indicate better score and greater impact on Happiness Score.

  • e.g. Singapore ranks # 1 in Healthy Life expectancy and the people think that it contributes a lot to their overall Happiness.

  • Negative Attributes (total 2)

  • There are 2 columns in the data set that indicate Negative Attributes: Negative Affect and Corruption.

  • Higher number of these attributes indicate worst score and greater negative impact on Happiness Score.

  • e.g. Moldova ranks # 148 in Corruption. This means that the people have indicated their Country has more corruption that affects their Happiness.

R Libraries used for Analysis

The R Libraries used for analysis are UsingR, plotly, ggplot2, corrplot, dplyr, tidyr, rworldmap, prob, sampling and RColorBrewer.

Steps to Import dataset into R

  1. Download the dataset from the source in .csv format and save it to the working directory:
  2. Import the dataset into a data frame from the working directory using the below R command: Note that we have renamed the columns of the data while importing into the data frame as part of cleaning process for the sake of simplicity.
  1. Verify if import was done correctly by looking at a few rows using head() function:
##       Country Happiness.Rank SD.of.Ladder Positive.Affect Negative.Affect
## 1     Finland              1            4              41              10
## 2     Denmark              2           13              24              26
## 3      Norway              3            8              16              29
## 4     Iceland              4            9               3               3
## 5 Netherlands              5            1              12              25
## 6 Switzerland              6           11              44              21
##   Social.Support Freedom Corruption Generosity Log.of.GDP Healthy.Life
## 1              2       5          4         47         22           27
## 2              4       6          3         22         14           23
## 3              3       3          8         11          7           12
## 4              1       7         45          3         15           13
## 5             15      19         12          7         12           18
## 6             13      11          7         16          8            4
  1. Check the structure of the data frame using str() function:
## 'data.frame':    156 obs. of  11 variables:
##  $ Country        : Factor w/ 156 levels "Afghanistan",..: 44 37 106 58 100 134 133 101 24 7 ...
##  $ Happiness.Rank : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SD.of.Ladder   : int  4 13 8 9 1 11 18 15 23 10 ...
##  $ Positive.Affect: int  41 24 16 3 12 44 34 22 18 64 ...
##  $ Negative.Affect: int  10 26 29 3 25 21 8 12 49 24 ...
##  $ Social.Support : int  2 4 3 1 15 13 25 5 20 31 ...
##  $ Freedom        : int  5 6 3 7 19 11 10 8 9 26 ...
##  $ Corruption     : int  4 3 8 45 12 7 6 5 11 19 ...
##  $ Generosity     : int  47 22 11 3 7 16 17 8 14 25 ...
##  $ Log.of.GDP     : int  22 14 7 15 12 8 13 26 19 16 ...
##  $ Healthy.Life   : int  27 23 12 13 18 4 17 14 8 15 ...
  1. Total number of rows in dataset: 156
  1. We will now add a Continent column to the dataset for the respective Countries. This will be useful to analyze the data at the Continent / Region level.
# Add Continent column:
asia <- c("Israel", "United Arab Emirates", "Singapore", "Thailand", "Taiwan",
        "Qatar", "Saudi Arabia", "Kuwait", "Bahrain", "Malaysia", "Uzbekistan", "Japan",
        "South Korea", "Turkmenistan", "Kazakhstan", "Turkey", "Hong Kong", "Philippines",
        "Jordan", "China", "Pakistan", "Indonesia", "Azerbaijan", "Lebanon", "Vietnam",
        "Tajikistan", "Bhutan", "Kyrgyzstan", "Nepal", "Mongolia", "Palestinian Territories",
        "Iran", "Bangladesh", "Myanmar", "Iraq", "Sri Lanka", "Armenia", "India", "Georgia",
        "Cambodia", "Afghanistan", "Yemen", "Syria", "Laos")
europe <- c("Norway", "Denmark", "Iceland", "Switzerland", "Finland",
          "Netherlands", "Sweden", "Austria", "Ireland", "Germany",
          "Belgium", "Luxembourg", "United Kingdom", "Czech Republic",
          "Malta", "France", "Spain", "Slovakia", "Poland", "Italy",
          "Russia", "Lithuania", "Latvia", "Moldova", "Romania",
          "Slovenia", "Northern Cyprus", "Cyprus", "Estonia", "Belarus",
          "Serbia", "Hungary", "Croatia", "Kosovo", "Montenegro",
          "Greece", "Portugal", "Bosnia and Herzegovina", "Macedonia",
          "Bulgaria", "Albania", "Ukraine")
north.america <- c("Canada", "Costa Rica", "United States", "Mexico",  
                 "Panama","Trinidad and Tobago", "El Salvador", "Guatemala",
                 "Jamaica", "Nicaragua", "Dominican Republic", "Honduras",
                 "Haiti")
south.america <- c("Chile", "Brazil", "Argentina", "Uruguay",
                 "Colombia", "Ecuador", "Bolivia", "Peru",
                 "Paraguay", "Venezuela")
australia <- c("New Zealand", "Australia")
africa <- c("Libya", "Mauritius", "Nigeria", "Algeria", "Morocco", "Cameroon", "Ghana",
            "Ivory Coast", "Benin", "Congo (Brazzaville)", "Gabon", "South Africa", "Senegal",
            "Somalia", "Namibia", "Niger", "Burkina Faso", "Guinea", "Gambia", "Kenya", "Mauritania",
            "Mozambique", "Tunisia", "Congo (Kinshasa)", "Mali", "Sierra Leone", "Chad", "Ethiopia",
            "Swaziland", "Uganda", "Egypt", "Zambia", "Togo", "Liberia", "Comoros", "Madagascar",
            "Lesotho", "Burundi", "Zimbabwe", "Botswana", "Malawi", "Rwanda", "Tanzania", "Central African Republic",
            "South Sudan")

df.whr$Continent <- NA
df.whr[df.whr$Country %in% australia, ]$Continent <- "Australia"
df.whr[df.whr$Country %in% africa, ]$Continent <- "Africa"
df.whr[df.whr$Country %in% south.america, ]$Continent <- "South America"
df.whr[df.whr$Country %in% north.america, ]$Continent <- "North America"
df.whr[df.whr$Country %in% europe, ]$Continent <- "Europe"
df.whr[df.whr$Country %in% asia, ]$Continent <- "Asia"
  1. Update the levels of factor variable Country in order of Happiness Rank / Ladder so that the levels of the factor variable Country is aligned to the Happiness rank of the country.
  1. From the dataset, we can now extract the Top 10 Happiest countries and store them in a smaller size dataframe:
##        Country Happiness.Rank
## 1      Finland              1
## 2      Denmark              2
## 3       Norway              3
## 4      Iceland              4
## 5  Netherlands              5
## 6  Switzerland              6
## 7       Sweden              7
## 8  New Zealand              8
## 9       Canada              9
## 10     Austria             10
  1. Similarly, we will extract the Bottom 10 Unhappy countries and store them in another smaller size dataframe:
##                      Country Happiness.Rank
## 147                    Haiti            147
## 148                 Botswana            148
## 149                    Syria            149
## 150                   Malawi            150
## 151                    Yemen            151
## 152                   Rwanda            152
## 153                 Tanzania            153
## 154              Afghanistan            154
## 155 Central African Republic            155
## 156              South Sudan            156
  1. We will now identify Countries / Rows with at least 1 null value which indicate number of empty values / columns for the corresponding country. Using the missing values column, we will create another dataframe with only the Countries that have values in all the columns. This step is required to have a dataframe for appropriate plots that will not be impacted due to missing values. The total number of rows in this dataframe without missing values is 139.

3. Analyzing the Data

We have categorized our analysis between All Countries (with no missing values), Top 10 and Bottom 10 ranking Countries and Continents / Regions.

All Countries

We will first understand the attributes that contributes the most and least to the Happiness of a Country. To do that, we will try to plot a correlation matrix of all attributes.

Correlation matrix: All attributes

From the above plot, we conclude below:

  • Social Support, Healthy Life expectancy and Log of GDP contribute the most to a better Happiness rank of the Country.

  • Corruption has the least correlation to Happiness rank of a Country. As the sense of Corruption in people gets higher, the Happiness of the Country will lower.

  • Log of GDP and Healthy Life expectancy are the most correlated attributes to each other.

Top contributors

We will prove the correlation analysis by plotting the pairs plot below:

  • Notice that Happiness Rank of a Country gets linearly impacted by the Social Support, Healthy Life expectancy and Log of GDP of the Country.

Least contributors

  • Similarly, We see that the sense of Corruption and Negative emotion of the people affect the Happiness Score. They do not contribute to a better Happiness Rank.

Top 10 vs Bottom 10

We will now conduct our analysis on Top 10 happiest and Bottom 10 unhappy Countries. We will try to understand how different aspects of a Country impacts its Happiness score.

Numerical Variable: Positive Affect

Categorical Variable: Country

  • The plot will display the ranking of Positive Emotion in the people of the Country that contributes to their Happiness.
  • Smaller number of the Positive Affect rank indicates that it has contributed more to the Happiness of that Country.
  • By looking at these 2 plots, we can compare Top 10 ranked Countries in Happiness and the Bottom 10 ranked Countries.
  • We can clearly see that people from the Happiest countries have given higher rank to the Positive Emotion in them as compared to the Bottom 10 countries.

Plot All Attributes

We will now try to plot all attributes of the Top 10 and Bottom 10 countries.

  • From this we conclude that Countries that have a good Happiness rank have consistent good ranks across their individual attributes.
  • We do observe some outliers like Negative Emotion is higher in Canada, however its Healthy Life expectancy and sense of Freedom increase its Happiness Rank.
  • Similarly, in the Bottom 10 list, we see Haiti and Rwanda having very good sense of Generosity and Freedom respectively among the people. However, their other factors like Log of GDP and Social Support bring down their Happiness rank.

Boxplot: Comparison of all Attributes

  • Here we can point out the outliers in the various attributes of the Countries.
  • We can see that sense of Freedom is spread out over the scale indicating people have different opinions on the sense of Freedom in their respective Countries.
  • The same behaviour is seen for Negative Emotion and Generosity.

  • We conclude here that spread of Happiness rank is similar to Log of GDP and Healthy Life expectancy indicating that the Happiness score of the Top 10 countries are impacted by these attributes.

Continents

We will now analyze how the Happiness score is impacted by Continents or Geographical presence of the Countries.

We have plot below the distribution of some key attributes that impact the Happiness score.

Distributions

Log of GDP

  • The Log of GDP distribution is having higher or worst scores in the African, Asian and South American continent.

  • Europe and Australian continent seem to have better scores for Log Of GDP compared to other continents.

  • The Log of GDP scores vary a lot in the North American countries.

par(mfrow=c(3,2))

# Africa
plot(df.notNA.Africa$Log.of.GDP, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "deepskyblue2", main = "Africa: Log of GDP")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Africa$Log.of.GDP), df.notNA.Africa$Log.of.GDP, pch = 16) 
abline(h = 0, col="red")

# Asia
plot(df.notNA.Asia$Log.of.GDP, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "deepskyblue2", main = "Asia: Log of GDP")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Asia$Log.of.GDP), df.notNA.Asia$Log.of.GDP, pch = 16) 
abline(h = 0, col="red")

# Europe
plot(df.notNA.Europe$Log.of.GDP, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "deepskyblue2", main = "Europe: Log of GDP")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Europe$Log.of.GDP), df.notNA.Europe$Log.of.GDP, pch = 16) 
abline(h = 0, col="red")

# Australia
plot(df.notNA.Aus$Log.of.GDP, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "deepskyblue2", main = "Australia: Log of GDP")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Aus$Log.of.GDP), df.notNA.Aus$Log.of.GDP, pch = 16) 
abline(h = 0, col="red")

# North America
plot(df.notNA.NorthAm$Log.of.GDP, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "deepskyblue2", main = "North America: Log of GDP")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.NorthAm$Log.of.GDP), df.notNA.NorthAm$Log.of.GDP, pch = 16) 
abline(h = 0, col="red")

# South America
plot(df.notNA.SouthAm$Log.of.GDP, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "deepskyblue2", main = "South America: Log of GDP")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.SouthAm$Log.of.GDP), df.notNA.SouthAm$Log.of.GDP, pch = 16) 
abline(h = 0, col="red")

Healthy life expectancy

  • Healthy Life expectancy follows almost similar distribution to Log of GDP.

  • This indicates the correlation between the 2 attributes is very linear.

  • We see that again Africa seems to have the worst Healthy Life expectancy scores while Europe and Australian continents have the best scores.

par(mfrow=c(3,2))

# Africa
plot(df.notNA.Africa$Healthy.Life, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "forestgreen", main = "Africa: Healthy Life expectancy")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Africa$Healthy.Life), df.notNA.Africa$Healthy.Life, pch = 16) 
abline(h = 0, col="red")

# Asia
plot(df.notNA.Asia$Healthy.Life, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "forestgreen", main = "Asia: Healthy Life expectancy")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Asia$Healthy.Life), df.notNA.Asia$Healthy.Life, pch = 16) 
abline(h = 0, col="red")

# Europe
plot(df.notNA.Europe$Healthy.Life, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "forestgreen", main = "Europe: Healthy Life expectancy")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Europe$Healthy.Life), df.notNA.Europe$Healthy.Life, pch = 16) 
abline(h = 0, col="red")

# Australia
plot(df.notNA.Aus$Healthy.Life, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "forestgreen", main = "Australia: Healthy Life expectancy")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Aus$Healthy.Life), df.notNA.Aus$Healthy.Life, pch = 16) 
abline(h = 0, col="red")

# North America
plot(df.notNA.NorthAm$Healthy.Life, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "forestgreen", main = "North America: Healthy Life expectancy")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.NorthAm$Healthy.Life), df.notNA.NorthAm$Healthy.Life, pch = 16) 
abline(h = 0, col="red")

# South America
plot(df.notNA.SouthAm$Healthy.Life, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "forestgreen", main = "South America: Healthy Life expectancy")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.SouthAm$Healthy.Life), df.notNA.SouthAm$Healthy.Life, pch = 16) 
abline(h = 0, col="red")

Corruption

  • The Corruption attribute seems to vary a lot in all the continents.

  • However, Europe and Australian continents have lot of low corruption scores.

par(mfrow=c(3,2))

# Africa
plot(df.notNA.Africa$Corruption, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "firebrick2", main = "Africa: Corruption")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Africa$Corruption), df.notNA.Africa$Corruption, pch = 16) 
abline(h = 0, col="red")

# Asia
plot(df.notNA.Asia$Corruption, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "firebrick2", main = "Asia: Corruption")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Asia$Corruption), df.notNA.Asia$Corruption, pch = 16) 
abline(h = 0, col="red")

# Europe
plot(df.notNA.Europe$Corruption, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "firebrick2", main = "Europe: Corruption")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Europe$Corruption), df.notNA.Europe$Corruption, pch = 16) 
abline(h = 0, col="red")

# Australia
plot(df.notNA.Aus$Corruption, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "firebrick2", main = "Australia: Corruption")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Aus$Corruption), df.notNA.Aus$Corruption, pch = 16) 
abline(h = 0, col="red")

# North America
plot(df.notNA.NorthAm$Corruption, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "firebrick2", main = "North America: Corruption")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.NorthAm$Corruption), df.notNA.NorthAm$Corruption, pch = 16) 
abline(h = 0, col="red")

# South America
plot(df.notNA.SouthAm$Corruption, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "firebrick2", main = "South America: Corruption")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.SouthAm$Corruption), df.notNA.SouthAm$Corruption, pch = 16) 
abline(h = 0, col="red")

Negative Affect

  • The Negative Emotion varies a lot in Asia and Europe.

  • The other continents namely Africa and South America have some higher Negative emotion scores.

par(mfrow=c(3,2))

# Africa
plot(df.notNA.Africa$Negative.Affect, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "maroon4", main = "Africa: Negative Affect")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Africa$Negative.Affect), df.notNA.Africa$Negative.Affect, pch = 16) 
abline(h = 0, col="red")

# Asia
plot(df.notNA.Asia$Negative.Affect, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "maroon4", main = "Asia: Negative Affect")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Asia$Negative.Affect), df.notNA.Asia$Negative.Affect, pch = 16) 
abline(h = 0, col="red")

# Europe
plot(df.notNA.Europe$Negative.Affect, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "maroon4", main = "Europe: Negative Affect")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Europe$Negative.Affect), df.notNA.Europe$Negative.Affect, pch = 16) 
abline(h = 0, col="red")

# Australia
plot(df.notNA.Aus$Negative.Affect, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "maroon4", main = "Australia: Negative Affect")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.Aus$Negative.Affect), df.notNA.Aus$Negative.Affect, pch = 16) 
abline(h = 0, col="red")

# North America
plot(df.notNA.NorthAm$Negative.Affect, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "maroon4", main = "North America: Negative Affect")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.NorthAm$Negative.Affect), df.notNA.NorthAm$Negative.Affect, pch = 16) 
abline(h = 0, col="red")

# South America
plot(df.notNA.SouthAm$Negative.Affect, type = "h",
     xlim = c(0, 46), ylim = c(0, 156),
     xaxt = "n", yaxt = "n",
     xlab = "x", ylab = "Rank",
     col = "maroon4", main = "South America: Negative Affect")
axis(side = 1, at = 0:46, labels = TRUE)
axis(side = 2, at = seq(0, 156, 10), labels = TRUE)
points(1:length(df.notNA.SouthAm$Negative.Affect), df.notNA.SouthAm$Negative.Affect, pch = 16) 
abline(h = 0, col="red")

  • We conclude that Europe, Australia and North American Countries are among the Happiest countries and continents in that order.
  • Africa is the least happy continent in the world followed by Asia.

Boxplot of Happiness Rank

  • The conclusion above in the distributions can be confirmed by the plot below.
  • Europe, Australia, North America and South America are closest to the lower (better) scores.
  • Haiti is the outlier in North American continent with higher unhappy Score.
  • Africa and Asia are above the mean Happiness Score.

Happiness Rank: World Map

  • The below intensity map of the Happiness score will indicate that lighter the color of the country, the Happier it is in the world.
  • We clearly see that Europe, Australia, North and South America are the Happiest continents.
  • Finland ranks # 1 in the Happiness Score while South Sudan is the least Happy Country in the World.
## 154 codes from your data successfully matched countries in the map
## 2 codes from your data failed to match with a country code in the map
## 89 codes from the map weren't represented in your data

Sampling:

Demonstrate Central Limit Theorem

We will use numerical variable: Happiness Rank

  • Original dataset distribution

  • We will now draw 100 samples from the dataset (with no missing column values).

  • We will then take 10, 20, 30 and 40 mean value samples and plot them to demonstrate the Central Limit Theorem.

## [1] "Sample Size =  10  Mean =  78.84  SD =  14.23"
## [2] "Sample Size =  20  Mean =  79.45  SD =  10.3" 
## [3] "Sample Size =  30  Mean =  79.14  SD =  8.36" 
## [4] "Sample Size =  40  Mean =  79.2  SD =  7.44"
  • With 1000 samples we can see that the different mean value samples form a normal distribution plot.

  • As the Sample size increases, we see that the mean of the sample means remain almost same. However, the Standard Deviation keeps getting smaller indicating that the Normal distribution plot keeps getting narrower.

Sampling methods

1. Simple Random Sampling:

  • In this sampling method we will pull 20 random samples from the dataset (with no missing column values).

  • We will perform the sampling without replacement.

##          Country Happiness.Rank SD.of.Ladder Positive.Affect Negative.Affect
## 2        Denmark              2           13              24              26
## 3         Norway              3            8              16              29
## 10       Austria             10           10              64              24
## 32        Brazil             32          116              69             105
## 38      Slovakia             38           39              53              47
## 43      Colombia             43          120              30              88
## 56       Jamaica             56          102              51              51
## 67      Pakistan             67           53             130             111
## 71       Moldova             71           45             133              67
## 89       Morocco             89          101             110              91
## 97      Bulgaria             97           47             117              13
## 98         Ghana             98          129              92              72
## 107      Albania            107          126              90             108
## 115 Burkina Faso            115           92             115             117
## 126         Iraq            126          147             151             154
## 137        Egypt            137           66             146             124
## 140        India            140           41              93             115
## 142      Comoros            142          143              67             114
## 143   Madagascar            143           77              46              96
## 147        Haiti            147          111             142             119
##     Social.Support Freedom Corruption Generosity Log.of.GDP Healthy.Life
## 2                4       6          3         22         14           23
## 3                3       3          8         11          7           12
## 10              31      26         19         25         16           15
## 32              43      84         71        108         70           72
## 38              21     108        142         70         35           38
## 43              52      56        124        111         74           51
## 56              28      49        130        119         93           55
## 67             130     114         55         58        110          114
## 71              65     128        148         86        109           86
## 89             139      76         84        154         98           79
## 97              18     115        147        112         56           65
## 98             132      91        117         52        114          121
## 107            133      87        134         60         81           40
## 115            116     127         47        125        137          136
## 126            124     130         66         73         64          107
## 137            118     129         89        132         85          101
## 140            142      41         73         65        103          105
## 142            143     148         81         62        143          117
## 143            128     146        116        136        144          111
## 147            146     152         48         20        138          125
##         Continent
## 2          Europe
## 3          Europe
## 10         Europe
## 32  South America
## 38         Europe
## 43  South America
## 56  North America
## 67           Asia
## 71         Europe
## 89         Africa
## 97         Europe
## 98         Africa
## 107        Europe
## 115        Africa
## 126          Asia
## 137        Africa
## 140          Asia
## 142        Africa
## 143        Africa
## 147 North America
  • We will plot a correlation matrix to analyze if our conclusions change based on the sample.

From the above plot, we conclude below:

  • Our original conclusions have remained intact.

  • Social Support, Healthy Life expectancy and Log of GDP are still the most contributing factors for a better Happiness Score.

  • We do see that Countries in the selected sample have higher Negative Affect correlation to Happiness Rank.

  • Log of GDP and Healthy Life expectancy are still the most correlated attributes in the selected sample.

2. Systematic Sampling:

  • In this sampling method, we will pull 20 random samples from the dataset (with no missing column values).

  • We will perform the sampling without replacement.

  • It so happens that we will take the 7th row from the dataset to form the sample dataset.

##                 Country Happiness.Rank SD.of.Ladder Positive.Affect
## 6           Switzerland              6           11              44
## 13               Israel             13           14             104
## 20       Czech Republic             20           20              74
## 31               Panama             31          121               7
## 39  Trinidad and Tobago             39           89              14
## 47            Argentina             47           97              28
## 55              Estonia             55           32              50
## 62              Hungary             62           36              86
## 70               Serbia             70          100             148
## 79               Turkey             79           58             154
## 86           Kyrgyzstan             86           46              58
## 95               Bhutan             95            6              37
## 103 Congo (Brazzaville)            103          152             124
## 111             Senegal            111           44              68
## 119             Georgia            119           51             141
## 126                Iraq            126          147             151
## 133             Ukraine            133           69             131
## 141             Liberia            141          156             103
## 148            Botswana            148          125              87
## 156         South Sudan            156          140             127
##     Negative.Affect Social.Support Freedom Corruption Generosity Log.of.GDP
## 6                21             13      11          7         16          8
## 13               69             38      93         74         24         31
## 20               22             24      58        121        117         32
## 31               48             41      32        104         88         51
## 39               52             29      51        141         41         38
## 47               93             46      54        109        123         55
## 55                6             12      45         30         83         37
## 62               31             51     138        140        100         42
## 70               92             57     124        118         84         71
## 79              121             61     140         50         98         44
## 86                4             45      38        138         36        120
## 95               98             68      59         25         13         95
## 103             136            138      92         60        140        111
## 111              60            106     121         88        130        126
## 119              43            147     104         28        153         87
## 126             154            124     130         66         73         64
## 133              44             56     141        143         66         94
## 141             146            127      94        126        110        150
## 148              65            105      60         54        150         66
## 156             152            148     154         61         85        140
##     Healthy.Life     Continent
## 6              4        Europe
## 13            11          Asia
## 20            31        Europe
## 31            33 North America
## 39            93 North America
## 47            37 South America
## 55            41        Europe
## 62            56        Europe
## 70            48        Europe
## 79            69          Asia
## 86            91          Asia
## 95           104          Asia
## 103          116        Africa
## 111          109        Africa
## 119           84          Asia
## 126          107          Asia
## 133           87        Europe
## 141          126        Africa
## 148          113        Africa
## 156          143        Africa
  • We will plot a correlation matrix to analyze if our conclusions change based on the sample.

From the above plot, we conclude below:

  • Our original conclusions have remained intact.

  • Social Support, Healthy Life expectancy and Log of GDP are still the most contributing factors for a better Happiness Score.

  • We do see that Countries in the selected sample have lower Corruption correlation that indicate that there are some of the high Corruption ranking countries in the selected sample.

  • Log of GDP and Healthy Life expectancy are still the most correlated attributes in the selected sample.

3. Stratified Sampling:

  • In the stratified sampling process, we will Order the dataset (with no missing column values) by the Continent name.

  • We will then pull 40 proportional samples based on the number of Countries belonging to the respective Continents and create our sample dataset.

  • We will perform the sampling with replacement.

## Stratum 1 
## 
## Population total and number of selected units: 43 12 
## Stratum 2 
## 
## Population total and number of selected units: 32 9 
## Stratum 3 
## 
## Population total and number of selected units: 2 1 
## Stratum 4 
## 
## Population total and number of selected units: 39 11 
## Stratum 5 
## 
## Population total and number of selected units: 13 4 
## Stratum 6 
## 
## Population total and number of selected units: 10 3 
## Number of strata  6 
## Total number of selected units 40
##                 Country Happiness.Rank SD.of.Ladder Positive.Affect
## 57            Mauritius             57           94              55
## 57.1          Mauritius             57           94              55
## 122          Mauritania            122           68              94
## 124             Tunisia            124           79             147
## 127    Congo (Kinshasa)            127           78             125
## 129        Sierra Leone            129          153             139
## 136              Uganda            136          148              91
## 139                Togo            139          103             123
## 141             Liberia            141          156             103
## 142             Comoros            142          143              67
## 145             Burundi            145          138              98
## 153            Tanzania            153          122              78
## 34            Singapore             34            5              38
## 41           Uzbekistan             41           99              19
## 52             Thailand             52           81              20
## 54          South Korea             54           57             101
## 74           Tajikistan             74           50             120
## 79               Turkey             79           58             154
## 80             Malaysia             80           12              25
## 95               Bhutan             95            6              37
## 151               Yemen            151           85             153
## 8           New Zealand              8           15              22
## 7                Sweden              7           18              34
## 15       United Kingdom             15           16              52
## 18              Belgium             18            7              57
## 42            Lithuania             42           55             138
## 42.1          Lithuania             42           55             138
## 44             Slovenia             44           54             114
## 48              Romania             48           75              80
## 62              Hungary             62           36              86
## 66             Portugal             66           73              97
## 81              Belarus             81           22             149
## 97             Bulgaria             97           47             117
## 9                Canada              9           23              18
## 12           Costa Rica             12           62               4
## 77   Dominican Republic             77          155              66
## 147               Haiti            147          111             142
## 32               Brazil             32          116              69
## 33              Uruguay             33           88              10
## 47            Argentina             47           97              28
##      Negative.Affect Social.Support Freedom Corruption Generosity Log.of.GDP
## 57                16             54      40         96         37         53
## 57.1              16             54      40         96         37         53
## 122               58             99     151         67        148        117
## 124              132            121     143        101        144         84
## 127               95            107     125        106        127        149
## 129              149            135     116        112         79        145
## 136              139            114      99         95         74        136
## 139              147            149     120         72        131        142
## 141              146            127      94        126        110        150
## 142              114            143     148         81         62        143
## 145              126            152     135         23        149        151
## 153               50            131      78         34         49        125
## 34                 2             36      20          1         21          3
## 41                15             11       1         18         29        104
## 52                35             53      18        131         10         62
## 54                45             91     144        100         40         27
## 74                54            113      86         35         72        123
## 79               121             61     140         50         98         44
## 80                23             97      36        137         27         40
## 95                98             68      59         25         13         95
## 151               75            100     147         83        155        141
## 8                 12              5       8          5          8         26
## 7                  8             25      10          6         17         13
## 15                42              9      63         15          4         23
## 18                53             22      53         20         44         21
## 42                41             17     122        113        124         36
## 42.1              41             17     122        113        124         36
## 44                71             14      13         97         54         34
## 48                62             86      57        146        102         48
## 62                31             51     138        140        100         42
## 66               100             47      37        135        122         39
## 81                36             33     131         37        103         58
## 97                13             18     115        147        112         56
## 9                 49             20       9         11         14         19
## 12                87             42      16         58         75         67
## 77                77             55      43         52         93         69
## 147              119            146     152         48         20        138
## 32               105             43      84         71        108         70
## 33                76             35      30         33         80         52
## 47                93             46      54        109        123         55
##      Healthy.Life     Continent ID_unit      Prob Stratum
## 57             73        Africa       1 0.2460023       1
## 57.1           73        Africa       1 0.2460023       1
## 122           120        Africa      20 0.2460023       1
## 124            67        Africa      22 0.2460023       1
## 127           140        Africa      23 0.2460023       1
## 129           146        Africa      25 0.2460023       1
## 136           127        Africa      28 0.2460023       1
## 139           132        Africa      31 0.2460023       1
## 141           126        Africa      32 0.2460023       1
## 142           117        Africa      33 0.2460023       1
## 145           135        Africa      36 0.2460023       1
## 153           118        Africa      41 0.2460023       1
## 34              1          Asia      45 0.2485407       2
## 41             83          Asia      46 0.2485407       2
## 52             58          Asia      47 0.2485407       2
## 54              9          Asia      48 0.2485407       2
## 74             92          Asia      53 0.2485407       2
## 79             69          Asia      54 0.2485407       2
## 80             59          Asia      55 0.2485407       2
## 95            104          Asia      62 0.2485407       2
## 151           124          Asia      74 0.2485407       2
## 8              14     Australia      76 0.5000000       3
## 7              17        Europe      84 0.2485348       4
## 15             24        Europe      87 0.2485348       4
## 18             26        Europe      90 0.2485348       4
## 42             62        Europe      98 0.2485348       4
## 42.1           62        Europe      98 0.2485348       4
## 44             29        Europe      99 0.2485348       4
## 48             61        Europe     100 0.2485348       4
## 62             56        Europe     104 0.2485348       4
## 66             22        Europe     105 0.2485348       4
## 81             76        Europe     111 0.2485348       4
## 97             65        Europe     114 0.2485348       4
## 9               8 North America     117 0.2739750       5
## 12             28 North America     118 0.2739750       5
## 77             80 North America     128 0.2739750       5
## 147           125 North America     129 0.2739750       5
## 32             72 South America     131 0.2710000       6
## 33             35 South America     132 0.2710000       6
## 47             37 South America     134 0.2710000       6
  • We will plot a correlation matrix to analyze if our conclusions change based on the sample.

From the above plot, we conclude below:

  • Our original conclusions have remained intact.

  • Social Support, Healthy Life expectancy and Log of GDP are still the most contributing factors for a better Happiness Score.

  • Log of GDP and Healthy Life expectancy are still the most correlated attributes in the selected sample.

4. Conclusion

  1. The positive factors Social Support, Healthy Life expectancy and Log of GDP contribute most to happiness of the Country.
  2. Corruption is least contributor to a better Happiness score of a Country. This means that higher the sense of Corruption among the people, lower the Happiness of the Country.
  3. Europe and Australian continents are the Happiest while African countries are on the least Happy score in the world.
  4. The Western Countries are Happier than the Eastern Countries.